Q-BERT: Hessian-Based Ultra Low-Precision Quantization of BERT

125

5.3

Q-BERT: Hessian-Based Ultra Low-Precision Quantization of

BERT

Shen et al. [209] proposes Q-BERT, a low precision uniform quantization method that

utilizes the second order Hessian information. In particular, a Hessian-based mix-precision

method and a new group-wise quantization scheme are introduced.

5.3.1

Hessian-Based Mix-Precision

Due to different encoder layers attending to different structures and exhibiting different sen-

sitivity to quantization [45], the authors argue that assigning the same number of bits to all

the layers is sub-optimal. Thus, they explore mixed-precision quantization, where more bits

are assigned to more sensitive layers to retain performance. Previous method has developed

a Hessian AWare Quantization (HAWQ) [59] to determine mixed-bits assignments for each

layer. Its main idea is that the parameters in layers with higher Hessian spectrum (i.e.,

larger top eigenvalues) are more sensitive to quantization and require higher precision than

layers with small Hessian spectrum. However, they argue that the number of parameters for

each encoder layer in a transformer-based model is larger, e.g., 7M. Given that the Hessian

of each layer is a matrix of size 7M ×7M, directly compute the second-order statistics is in-

feasible. However, the authors adopt a matrix-free power iteration method [270] to compute

the Hessian spectrum, which does not require the explicit formation of the operator. The

matrix-free power iteration method can provide the top eigenvalues, which are then used

as the indicator of the sensitivity of a layer. The previous method [59] uses the averaged

top eigenvalues for different training data as the indicator. More aggressive quantization is

performed for layers with smaller top eigenvalues, corresponding to a flatter loss landscape.

However, the authors find that assigning bits based only on the average top eigenvalues

is infeasible for many NLP tasks, due to the top eigenvalues of Hessian for some layers

exhibiting very high variance with respect to different portions of the input dataset. To

address this, the following metric instead of just using mean value is adopted:

Ωi = |mean(λi)| + std(λi),

(5.9)

where λi is the distribution of the top eigenvalues of Hessian of layer i, calculated with

10% of training dataset. After Ωi is computed, they sort them in descending order, and use

it as a metric to relatively determine the quantization precision. Then, quantization-aware

finetuning is performed based on the selected precision setting. The eigenvalue distribution

of various datasets are provided in Fig. 5.5.

5.3.2

Group-Wise Quantization

For Bert-base, the dimension of each input token is 768 and each self-attention head has 4

dense matrices. Directly quantizing the 4 matrices in multi-head self-attention as an entirety

with the same quantization range can significantly degrade the accuracy, since there are

more than 2M parameters in total, and the weights corresponding to each neuron may lie

in a different range of full-precision numbers. Although channel-wise quantization can be

used to alleviate this problem in CNNs, each convolutional kernel can be treated as a single

output channel with its own quantization range. However, because each dense matrix used

in the transformer-based model adopts a single kernel, channel-wise quantization cannot

be directly applied. Therefore, the authors propose group-wise quantization for attention-

based models. In particular, each individual matrix W with respect to each head in one